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Abstract 



While statistics focusses on hypothesis testing and on estimating (proper- 
ties of) the true sampling distribution, in machine learning the performance of 
learning algorithms on future data is the primary issue. In this paper we bridge 
the gap with a general principle (PHI) that identifies hypotheses with best 
predictive performance. This includes predictive point and interval estimation, 
simple and composite hypothesis testing, (mixture) model selection, and 
others as special cases. For concrete instantiations we will recover well-known 
methods, variations thereof, and new ones. PHI nicely justifies, reconciles, 
and blends (a reparametrization invariant variation of) MAP, ML, MDL, and 
moment estimation. One particular feature of PHI is that it can genuinely 
deal with nested hypotheses. 



parameter estimation; hypothesis testing; model selection; predictive inference; 
composite hypotheses; MAP versus ML; moment fitting; Bayesian statistics. 
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1 Introduction 

Consider data D sampled from some distribution p{D\9) with unknown 9 eVL. The 
hkehhood function or the posterior contain the complete statistical information of 
the sample. Often this information needs to be summarized or simplified for various 
reasons (comprehensibility, communication, storage, computational efficiency, math- 
ematical tractability, etc.). Parameter estimation, hypothesis testing, and model 
(complexity) selection can all be regarded as ways of summarizing this information, 
albeit in different ways or context. The posterior might either be summarized by a 
single point = {^} (e.g. ML or MAP or mean or stochastic model selection), or by 
a convex set BCf2 (e.g. confidence or credible interval), or by a finite set of points 
Q = {6i,...fii} (mixture models) or a sample of points (particle filtering), or by the 
mean and covariance matrix (Gaussian approximation), or by more general density 
estimation, or in a few other ways |BM98l IBisOG] . I have roughly sorted the meth- 
ods in increasing order of complexity. This paper concentrates on set estimation, 
which includes (multiple) point estimation and hypothesis testing as special cases, 
henceforth jointly referred to as ^^hypothesis identification" (this nomenclature seems 
uncharged and naturally includes what we will do: estimation and testing of simple 
and complex hypotheses but not density estimation). We will briefiy comment on 
generalizations beyond set estimation at the end. 

Desirable properties. There are many desirable properties any hypothesis identi- 
fication principle ideally should satisfy. It should 

• lead to good predictions (that's what models are ultimately for), 

• be broadly applicable, 

• be analytically and computationally tractable, 

• be defined and make sense also for non-i.i.d. and non-stationary data, 

• be reparametrization and representation invariant, 

• work for simple and composite hypotheses, 

• work for classes containing nested and overlapping hypotheses, 

• work in the estimation, testing, and model selection regime, 

• reduce in special cases (approximately) to existing other methods. 

Here we concentrate on the first item, and will show that the resulting principle nicely 
satisfies many of the other items. 

The main idea. We address the problem of identifying hypotheses (parame- 
ters/models) with good predictive performance head on. If 6*0 is the true parame- 
ter, then p{x\6q) is obviously the best prediction of the m future observations x. 
If we don't know Oq but have prior belief p{9) about its distribution, the predictive 
distribution p{x\D) based on the past n observations D (which averages the likeli- 
hood p{x\6) over 6 with posterior weight p{6\D)) is by definition the best Bayesian 
predictor Often we cannot use full Bayes (for reasons discussed above) but predict 
with hypothesis H = {9eQ}, i.e. use p{x\Q) as prediction. The closer p{x\Q) is 
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to p{x\D) or p{x\9o,D]ij the better is H's prediction (by definition), where we can 
measure closeness with some distance function d. Since x and 9o are (assumed to be) 
unknown, we have to sum or average over them. 

Definition 1 (Predictive Loss) The predictive Loss/ Lass of Q given D based on 
distance d for m future observations is 



Predictive hypothesis identification (PHI) minimizes the losses w.r.t. some hypothesis 
class Ti. Our formulation is general enough to cover point and interval estimation, 
simple and composite hypothesis testing, (mixture) model (complexity) selection, and 
others. 

(Un) related work. The general idea of inference by maximizing predictive perfor- 
mance is not new [Gei93j . Indeed, in the context of model (complexity) selection 
it is prevalent in machine learning and implemented primarily by empirical cross 
validation procedures and variations thereof jZucOOj or by minimizing test and/or 
train set (generalization) bounds; see |Lan02] and references therein. There are also 
a number of statistics papers on predictive inference; see |Gei93j for an overview 
and older references, and |BB04l IMGBOSj for newer references. Most of them deal 
with distribution free methods based on some form of cross-validation discrepancy 
measure, and often focus on model selection. A notable exception is MLPD [LF82j . 
which maximizes the predictive likelihood including future observations. The full 
decision-theoretic setup in which a decision based on D leads to a loss depending on 
X, and minimizing the expected loss, has been studied extensively |BM98l IHutOSj . 
but scarcely in the context of hypothesis identification. On the natural progres- 
sion of estimation^prediction-^action, approximating the predictive distribution by 
minimizing ([T]) lies between traditional parameter estimation and optimal decision 
making. Formulation ([T]) is quite natural but I haven't seen it elsewhere. Indeed, 
besides ideological similarities the papers above bear no resemblance to this work. 

Contents. The main purpose of this paper is to investigate the predictive losses 
above and in particular their minima, i.e. the best predictor in Ti. Section [2] intro- 
duces notation, global assumptions, and illustrates PHI on a simple example. This 
also shows a shortcoming of MAP and ML esimtation. Section [3] formally states PHI, 
possible distance and loss functions, their minima. In Section HI I study exact proper- 
ties of PHI: invariances, sufficient statistics, and equivalences. Sections |5] investigates 
the limit m— i>oo in which PHI can be related to MAP and ML. Section [6] derives 
large sample approximations n ^ oo for which PHI reduces to sequential moment 

"'^So far we tacitly assumed that given 6q, x is independent D. For non-i.i.d. data this is generaUy 
not the case, hence the appearance of D. 




(1) 
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fitting (SMF). The results are subsequently used for Offline PHI. Section [7] contains 
summary, outlook and conclusions. Throughout the paper, the Bernoulli example 
will illustrate the general results. 

The main aim of this paper is to introduce and motivate PHI, demonstrate how it 
can deal with the difficult problem of selecting composite and nested hypotheses, and 
show how PHI reduces to known principles in certain regimes. The latter provides 
additional justification and support of previous principles, and clarifies their range of 
applicability. In general, the treatment is exemplary, not exhaustive. 



2 Preliminaries 



Setup. Let Z^ = D„ = (a;i,...,x„)=Xi:„GA'" be the observed sample with observations 
XiEX from some measurable space X, e.g. B!^' or IV or a subset thereof. Similarly let 
x = {xn+i,---,Xn+m)=Xn+i:n+m^'^^ bc potential /uiwrc observatioHS. We assume that 
D and x are sampled from some probability distribution P[-|6'], where is some 

unknown parameter. We do not assume independence of the Xjgjv unless otherwise 
stated. For simplicity of exposition we assume that the densities p{D\6) w.r.t. the 
default (Lebesgue or counting) measure (/ dX, written both henceforth as / dx) 
exist. 

Bayes. Similarly, we assume a prior distribution P[0] with density p{6) over pa- 
rameters. From prior p{6) and likelihood p{D\6) we can compute the posterior 
p{d\D) =p{D\6)p{6) / p{D) , where normalizer p(D) = j p{D\6)p{d)d6. The full Bayesian 
approach uses parameter averaging for prediction 

p{x\D) = [ p{x\e,D)p{e\D)de - 



piD) 



the so-called predictive distribution (or more precisely predictive density), which can 
be regarded as the gold standard for prediction (and there are plenty of results jus- 
tifying this |BCH93l IHutOSj l 

Composite likelihood. Let He be the simple hypothesis that x is sampled from 
p{x\6) and Hq the composite hypothesis that x is "sampled" from p{x\Q), where 
0Cf2. In the Bayesian framework, the ^^composite likelihood^ p{x\Q) is actually well 
defined (for measurable with P[9] >0) as an averaged likelihood 

p{x\e) = [ p{x\e)p{e\e)de, where p{e\Q) = for G e and else. 



MAP and ML. Let Ti be the (finite, countable, continuous, complete, or else) class 
of hypotheses Hq (or for short) from which the "best" one shall be selected. Each 
0G?i is assumed to be a measurable subset of Q. The maximum a posteriori (MAP) 
estimator is defined as 6^^^ = aigma:Xg<=-}^p{d\D) if Ti contains only simple hypotheses 
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and 0^^^ = argmaxeewP[0|i^] in the general case. The composite maximum likeli- 
hood estimator is defined as ©^'" = argmaxee?iP(-D|0), which reduces to ordinary ML 
for simple hypotheses. 

In order not to further clutter up the text with too much mathematical gibberish, 
we make the following global assumptions during informal discussions: 

Global Assumption 2 Wherever necessary, we assume that sets, spaces, and func- 
tions are measurable, densities exist w.r.t. some (Lebesgue or counting) base measure, 
observed events have non-zero probability, or densities conditioned on probability zero 
events are appropriately defined, in which case statements might hold with probability 
1 only. Functions and densities are sufficiently often (continuously) differentiable, 
and integrals exist and exchange. 



Bernoulli Example. Consider a binary ^" = {0,1} i.i.d. process = ^"^(1—^)"° 

with bias 9e [0,1] =Vt, and ni = Xi + ...+Xn = n—nQ the number of observed Is. Let us 
assume a uniform uniform prior p{9) = 1. Here but not generally in later continuations 
of the example we also assume uq = ni. Consider hypothesis class 7i = {Hf,Hy} 
containing simple hypothesis Hf = {6* = ^} meaning "fair" and composite vacuous 
alternative H^ — Q meaning "don't know" . It is easy to see that 

p(D\Hy) = p(D) = p^J^ < = p(D\Hf) for n > 1 (and = else) 
(n+1)! 

hence Q^^ — Hf, i.e. ML always suggests a fair coin however weak the evidence is. 
On the other hand, F[Hf\D] =0< l = P[if^|D], i.e. MAP never suggests a fair coin 
however strong the evidence is. 

Now consider PHI. Let mi = x„+i + ...a;„_|_m = '"^— ""^o be the number of future Is. 
The probabilities of x given Hf, H^, and D are, respectively 

(m+lj! (n+m+1)! nilno' 

For m = l we get p{l\Hf) = ^=p{l\Hv), so when concerned with predicting only 
one bit, both hypotheses are equally good. More generally, for an interval 0= [a,b], 
compare p(l|0) = ^:= |(a+6) to the full Bayesian prediction = ^^^^ (Laplace's 

rule). Hence if 7i is a class of interval hypotheses, then PHI chooses the QeH whose 
midpoint 9 is closest to Laplace's rule, which is reasonable. The size of the interval 
doesn't matter, since p{xn^i\Q) is independent of it. 

Things start to change for m, — 2. The following table lists p{x\D) for some D, 
together with p{x\Hf ) and p{x\Hy), and their prediction error Err{H) ■.=Lossl{H,D) 
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for d{p,q) = \p — q\ in ([T]) 



p{x\D) 


a; = 00 


X = 01|10 


:e = 11 


Err(if/) ^ Evv{Hy) 


Conclusion 


D = {} 


1/3 


1/3 


1/3 


1/3 > 


don't know 


D = 01 


3/10 


4/10 


3/10 


1/5 > 2/15 


don't know 


D = 0101 


2/7 


3/7 


2/7 


1/7 < 4/21 


fair 


D = (01)°° 


1/4 


1/2 


1/4 


< 1/3 


fair 


p{x\Hf) 


1/4 


1/2 


1/4 






p{x\Hy) 


1/3 


1/3 


1/3 







The last column contains the identified best predictive hypothesis. For four or more 
observations, PHI says "fair", otherwise "don't know". 

Using ([2]) or our later results, one can show more generally that PHI chooses 
"fair" for n^m and "don't know" for m^n. <^ 

MAP versus ML versus PHI. The conclusions of the example generalize: For 
©1^02; we have P[Gi|D] < P[02|-D], i.e. MAP always chooses the less specific hy- 
pothesis Hq^. On the other hand, we have p{D\9^^) >p{D\Q), since the maximum 
can never be smaller than an average, i.e. composite ML prefers the maximally spe- 
cific hypothesis. So interestingly, although MAP and ML give identical answers for 
uniform prior on simple hypotheses, their naive extension to composite hypotheses 
is diametral. While MAP is risk averse finding a likely true model of low predictive 
power, composite ML risks an (over)precise prediction. Sure, there are ways to make 
MAP and ML work for nested hypotheses. The Bernoulli example has also shown 
that PHI's answer depends not only on the past data size n but also on the future 
data size m. Indeed, if we make only few predictions based on a lot of data {m<^n), 
a point estimation (Hf) is typically sufficient, since there will not be enough future 
observations to detect any discrepancy. On the other hand, if m3>n, selecting a vac- 
uous model that ignores past data is better than selecting a potentially wrong 
parameter, since there is plenty of future data to learn from. This is exactly the 
behavior PHI exhibited in the example. 

3 Predictive Hypothesis Identification Principle 

We already have defined the predictive loss functions in ([1]). We now formally state 
our predictive hypothesis identification (PHI) principle, discuss possible distances d, 
and major prediction scenarios related to the choice of m. 

Distance functions. Throughout this work we assume that d is continuous and 
zero if and only if both arguments coincide. Some popular distances are: the (f) 
/-divergence d{p,q) = f{p/q)q for convex / with /(1) = 0, the (a) a-distance f(t) = 
— the (1) absolute deviation d{p,q) = \p—q\ (a=l), the (h) Hellinger distance 
d{p,q) = {y/p— y/qY (« = |); the (c) chi-square distance /(t) = (t— 1)^, the (k) KL- 
divergence f{t) = tint, and the (r) reverse KL-divergence f{t) = —Int. The only 
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distance considered here that is not an / divergence is the (2) squared distance 
d{p,q) = {p — qY- The /-divergence is particularly interesting, since it contains most 
of the standard distances and makes Loss representation invariant (RI). 



Definition 3 (Predictive hypothesis identification (PHI)) The best (Tjest) 
predictive hypothesis in Ti given D is defined as 

:= argminLossT(e,Z^) ( 9? := argmin L55s?(e, Z^) ) 

The PHI (PHI) principle states to predict x with probability p{x\Q''^) (p{x\Q'^)), 
which we call PHT^ (PHT^) prediction. 

Prediction modes. There exist a few distinct prediction scenarios and modes. Here 
are prototypes of the presumably most important ones: Infinite hatch: Assume we 
summarize our data D hy a. model/hypothesis G 7Y. The model is henceforth 
used as background knowledge for predicting and learning from further observations 
essentially indefinitely. This corresponds to m — > cx). Finite hatch: Assume the 
scenario above, but terminate after m predictions for whatever reason. This cor- 
responds to a finite m (often large). Offline: The selected model 6 is used for 
predicting Xk+i for k = n,...,n+m—l separately with p{xk+i\Q) without further learn- 
ing from Xn+i---Xk taking place. This corresponds to repeated m = l with common 6: 
Loss^'"(e,L'):=EE^^^™"^Loss^(e,Dfc)|L']. Online: At every step k = n,...,n+m-l 
we determine a (good) hypothesis from 7i based on past data D^, and use it 
only once for predicting x^+i- Then for + 1 we select a new hypothesis etc. This 
corresponds to repeated m = l with different 6: Loss = ^^^^~^Loss^(0fc,Dfc). 

The above list is not exhaustive. Other prediction scenarios are definitely possible. 
In all prediction scenarios above we can use Loss instead of Loss equally well. Since 
all time steps k in Online PHI are completely independent, online PHI reduces to 
1-Batch PHI, hence will not be discussed any further. 

4 Exact Properties of PHI 

Reparametrization and representation invariance (RI). An important sanity 
check of any statistical procedure is its behavior under reparametrization d^d = g{9) 
|KW96] and/or when changing the representation of observations Xi^yi = h{xi) 
|Wal96] ■ where g and h are bijections. If the parametrization/representation is judged 
irrelevant to the problem, any inference should also be independent of it. MAP 
and ML are both representation invariant, but (for point estimation) only ML is 
reparametrization invariant. 

Proposition 4 (Invariance of Loss) Loss^{Q,D) and Los^s]^{Q,D) are invariant 
under reparametrization of Q. If distance d is an f -divergence, then they are also 
independent of the representation of the observation space X. For continuous X, the 
transformations are assumed to be continuously differentiable. 
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RI for Loss^ is obvious, but will see later some interesting consequences. Any 
exact inference or any specialized form of PHIj will inherit RI. Similarly for ap- 
proximations, as long as they do not break RI. For instance, PHlh will lead to an 
interesting RI variation of MAP. 

SufRcient statistic. For large m, the integral in Definition [T] is prohibitive. Many 
models (the whole exponential family) possess a sufficient statistic which allows us 
to reduce the integral over A"™ to an integral over the sufficient statistic. Let 

T : Af™ R'^' be a sufficient statistic, i.e. p{x\T{x), 9) = p{x\T{x)) Va;, 9 (3) 

which implies that there exist functions g and h such that the likelihood factorizes 
into 

p{x\9) = h{x)g{Tix)\9) (4) 

The proof is trivial for discrete X (choose h{x) = p{x\T{x)) and g{t\9) =p{t\9) := 
P[T{x) = t\9]) and follows from Fisher's factorization theorem for continuous X. Let 
A be an event that is independent x given 9. Then multiplying (j4]) by p{9\A) and 
integrating over 9 yields 

p{x\A) = j p{x\9)p{9\A)d9 = h{x)g{T{x)\A), where (5) 

g{t\A) := J g{t\9)p{9\A)d9 (6) 

For some /?GiR let (non-probability) measure ^fs[B]: = J^^,rp^^^^^yh{x)^dx {BCIR'^') 

have density hf^{t) {t&IR^') w.r.t. to (Lebesgue or counting) base measure dt iJdt = J2t 
in the discrete case). Informally, 

hp{t) := J h{xY5{T{x) - t)dx (7) 

where 6 is the Dirac delta for continuous X (or the Kronecker delta for countable X, 
i.e. !dx5{T{x)-t) = Y.^.^^^)=t)- 

Theorem 5 (PHI for sufficient statistic) Let T{x) be a sufficient statistic ^ 
for 9 and assume x is independent D given 9, i.e. p{x\9,D) =p[x\9) . Then 



Loss^iQ,D) = j d{g{t\&),g{t\D))hp{t)dt 
L^s';i{Q,D) = [ d{g{t\Q),g{t\9))hp{t)p{9\D)dtd9 



holds (where g and hjs have been defined in Q), and provided one (or 

both) of the following conditions hold: (i) distance d scales with a power f3&lR, i.e. 
d{ap,aq) = a^d{p,q) fora>0, or (ii) any distance d, but h{x) = l in Q). One can 
choose g{t\-) =p{t\-) , the probability density oft, in which case hi(t) = l. 
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All distances defined in Section [3] satisfy (i), the /-divergences all with P = l and 
the square loss with f3 = 2. The independence assumption is rather strong. In practice, 
usually it only holds for some n if it holds for all n. Independence of Xn+i-.n+m from 
Dn given 9 for all n can only be satisfied for independent (not necessarily identically 
distributed) Xjgjv- 

Theorem 6 (Equivalence of PHI^^ and PHI^^,) For square distance (d=2 ) and 
RKL distance (d=r), Loss'^{Q,D) differs from Ldss^{Q,D) only by an additive 
constant c^{D) independent of B, hence PHI and PHI select the same hypotheses 

e™=e™ and e;"=e™. 

Bernoulli Example. Let us continue with our Bernoulli example with uniform 
prior. T{x) = xi + ... + Xm = 'mi = t E {0,...,m} is a sufficient statistic. Since 
X = {0,1} is discrete, fdt = Ym^o Jdx = ^^.g^m- In ([4]) we can choose 

git\9) =pIx\9) = 9\1-9)"'-' which implies h{x) = 1 and /i/3(t) = Ex;T(x)=tl = (T)- 
From definition ([5]) we see that g{t\D) =p{x\D) whose expression can be found in ([2]). 
For RKL-distance, Theorem[5]now yields Loss^(e|L') = E™ ^/i/3(t)^(t|D)ln||U. For 
a point hypothesis = {^} this evaluates to a constant minus m[^^^^ln6'+^^^3j^ln(l— 6')], 
which is minimized for 9=^^^^. Therefore the best predictive point ^r = '^^^ = 9r = 
Laplace rule, where we have used Theorem E] in the third equality. <^ 



5 PHI for (X)-Batch 

In this section we will study PHI for large m, or more precisely, the m^n regime. 
No assumption is made on the data size n, i.e. the results are exact for any n (small 
or large) in the limit m— >oo. For simplicity and partly by necessity we assume that 
the Xi^iN are i.i.d. (lifting the "identical" is possible). Throughout this section we 
make the following assumptions. 

Assumption 7 Let Xigw be independent and identically distributed, Q C M"^, the 
likelihood density p{xi\9) twice continuously diff'erentiable w.r.t. 9, and the boundary 
of has zero prior probability. 

We further define x := Xi (any i) and the partial derivative d := d/d9 = 
{d/d9i,...,d/d9iiy={di,...,ddy. The (two representations of the) Fisher information 
matrix of p{x\9) 

h{9) = E[{d\np{x\9)){d\np{x\9)y\9] = - j {dd^\np{x\9))p{x\9)dx (8) 
will play a crucial role in this Section. It also occurs in Jeffrey's prior, 

pj{9) := Vdet/i(^)/J, J ■■= [ Vdethi9)d9 (9) 
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a popular reparametrization invariant (objective) reference prior (when it exists) 
|KW96] ■ We call the determinant (det) of li{0), Fisher information. J can be 
interpreted as the intrinsic size of Q |Grii07] . Although not essential to this work, it 
will be instructive to occasionally plug it into our expressions. As distance we choose 
the Hellinger distance. 

Theorem 8 (Loss™(0,D) for large m) Under Assumption^T^ for point estimation, 
the predictive Hellinger loss for large m is 

d/2 



where the first expression holds for any continuous prior density and the second ex- 
pression (=) holds for Jeffrey's prior. 

IMAP. The asymptotic expression shows that minimizing LossJ^ is equivalent to the 
following maximization 

IMAP: = ^i^^P := argmax^SEL (10) 

Without the denominator, this would just be MAP estimation. We have discussed 
that MAP is not reparametrization invariant, hence can be corrupted by a bad choice 
of parametrization. Since the square root of the Fisher information transforms like 
the posterior, their ratio is invariant. So PHI led us to a nice reparametrization 
invariant variation of MAP, immune to this problem. Invariance of the expressions in 
Theorem[8]is not a coincidence. It has to hold due to PropositionHl For Jeffrey's prior 
(second expression in Theorem [8]), minimizing Loss^ is equivalent to maximizing the 
likelihood, i.e. 9'^ = 9^^. Remember that the expressions are exact even and especially 
for small samples Dn- No large n approximation has been made. For small ra, MAP, 
ML, and IMAP can lead to significantly different results. For Jeffrey's prior, IMAP 
and ML coincide. This is a nice reconciliation of MAP and ML: An "improved" MAP 
leads for Jeffrey's prior back to "simple" ML. 

MDL. We can also relate PHI to MDL by taking the logarithm of the second ex- 
pression in Theorem [HJ 

~9^ L argmin{-logp(/^|^) + f logf + J} (11) 

For m=4?T, this is the classical (large n approximation of) MDL jGriiOTj . So presuming 
that (fTTl) is a reasonable approximation of PHI even for m = 4n, MDL approximately 
minimizes the predictive Hellinger loss iff used for Oin) predictions. We will not 
expand on this, since the alluded relation to MDL stands on shaky grounds (for 
several reasons). 
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Corollary 9 (0^ = 0^^^^ = ^^^) The predictive estimator 
limm^ooS^TgmmgLass^{9,D) coincides with 0^^^^^ a representation invariant 
variation of MAP. In the special case of Jeffrey's prior, it also coincides with the 
maximum likelihood estimator 6^^. 

Theorem 10 (LossJ^(B,Z}) for large m) Under Assumption for composite 0, 
the predictive Hellinger loss for large m is 



where the first expression holds for any continuous prior density and the second ex- 
pression (=) holds for Jeffrey's prior. 

MAP meets ML half way. The second expression in Theorem [10] is proportional 
to the geometric average of the posterior and the composite likelihood. For large G 
the likelihood gets small, since the average involves many wrong models. For small 
B, the posterior is proportional to the volume of B hence tends to zero. The product 
is maximal for some B in-between: 



1 for e ^ n 



ML X MAP : = Em^ = PiDiejVWi ^ I „ fo, e^w (12) 

The regions where the posterior density p{6\D) and where the (point) likelihood 
p{D\6) are large are quite similar, as long as the prior is not extreme. Let Bq 
be this region. It typically has diameter 0(n~^/^). Increasing B D Bq cannot sig- 
nificantly increase P[B|-D] < 1, but significantly decreases the likelihood, hence the 
product gets smaller. Vice versa, decreasing B C Bq cannot significantly increase 
p{D\Q) <p{D\6^^), but significantly decreases the posterior. The value at Bq follows 
from P[Bo] ~ Volume(Bo) ~ 0(?7,~°'/^). Together this shows that Bq approximately 
maximizes the product of likelihood and posterior. So the best predictive Bq = B^ 
has diameter 0(n^^/^), which is a very reasonable answer. It covers well but not 
excessively the high posterior and high likelihood regions (provided Ti. is sufficiently 
rich of course). By multiplying the likelihood or dividing the posterior with only the 
square root of the prior, they meet half way! 



Bernoulli Example. A Bernoulli process with uniform prior and no = ni has 

posterior variance o"^ = Hence any reasonable symmetric interval estimate 
Q = [l-z;^ + z] of 9 will have size 2z = 0{n-^/'^). For PHI we get 

v/plBl V2i riiW Je V^z Va„v/2/ 
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where equahty ~ is a large n approximation, and erf(-) is the error function |AS74j . 
eii{x)/ y/x has a global maximum at a; = l within 1% precision. Hence PHI selects an 
interval of half- width z = \/2an- 

If faced with a binary decision between point estimate 0j = {|} and vacuous 
estimate 9^= [0;1], comparing the losses in Theorems 151 and [TUl we see that for large 
m, 9„ is selected, despite cr„ being close to zero for large n. In Section [2] we have 
explained that this makes from a predictive point of view. <) 

Finally note that ( fT2l) does not converge to (any monotone function of) ( fTOj) for 
B-^{0}, i.e. and 0ft°7^^/f , since the limits m^oo and 0— do not exchange. 

Finding 0^. Contrary to MAP and ML, an unrestricted maximization of f|T2l) over 
all measurable 6 C f2 makes sense. The following result reduces the optimization 
problem to finding the level sets of the likelihood function and to a one-dimensional 
maximization problem. 

Theorem 11 (Finding 0^ = eMLxMAP) @^ |^ . > ^} ^/^g ^-level 

set of p{D\6). //P[0^] is continuous in 7, then 

sMLxMAP . 



QMLXMAi^ ._ arg max = arg max 



e ^P[e] e,n>o y^W^] 

More precisely, every global maximum of ( f7^) differs from the maximizer 6^ at most 
on a set of measure zero. 

Using posterior level sets, i.e. shortest a-credible sets/intervals instead of likeli- 
hood level sets would not work (an indirect proof is that they are not RI). For a 
general prior, p{D\9) ^J p{0) / li{0) level sets need to be considered. The continuity 
assumption on P[07] excludes likelihoods with plateaus, which is restrictive if con- 
sidering non-analytic likelihoods. The assumption can be lifted by considering all 0^ 
in-between := {6 : p{D\6) > and 0^ := {6* : p(D|6') > 7}. Exploiting the special 
form of (fT2|) one can show that the maximum is attained for either 0° or 0^ with 7 
obtained as in the theorem. 

Large n. For large n (m^n^l), the likelihood usually tends to an (un-normalized) 
Gaussian with mean=mode 6 = 6^^ and covariance matrix [nli{6)]^^. Therefore the 
levels sets are ellipsoids 

e, = {e:{e- efHW -e)< r^} 

We know that the size r of the maximizing ellipsoid scales with 0{n~^/'^). For such 
tiny ellipsoids, f|T2l) is asymptotically proportional to 

n<dr\D] ^ leMmde ^ ^|.||<pe-ll-ll^/^^. ^ jpV^ ^./2-i,-.^^ ^ 7(%//2) 
VVolume[0.] J j\\.\\<,^dz P"" p'" 
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where z:=\J nIi{9){9—9)E]R'^, and p:=ry/n, andt:=|p^, and7(-,-) is the incomplete 
Gamma function |AS74] . and we dropped all factors that are independent of r. The 
expressions also holds for general prior in Theorem [HI since asymptotically the prior 
has no influence. They are maximized for the following f: 



d 


1 


2 


3 


4 


5 


10 


100 




oo 




1.400 


1.121 


1.009 


0.947 


0.907 


0.819 


0.721 




1/^2 



i.e. for m^n^l, unrestricted PHI selects ellipsoid 6^ = 6^ of (linear) size 0{^fdfn). 

So far we have considered Loss{[*. Analogous asymptotic expressions can be de- 
rived for Loss^: While Loss^ differs from Loss^, for point estimation their minima 
0^ = 0^ = 0^^^^ coincide. For composite B, the answer is qualitatively similar but 
differs quantitatively. 

6 Large Sample Approximations 

In this section we will study PHI for large sample sizes n, more precisely the n^m 
regime. For simplicity we concentrate on the univariate d=l case only. Data may be 
non-i.i.d. 

Sequential moment fitting (SMF). A classical approximation of the posterior 
density p{6\D) is by a Gaussian with same mean and variance. In case the class of 
available distributions is further restricted, it is still reasonable to approximate the 
posterior by the distribution whose mean and variance are closest to that of p{6\D). 
There might be a tradeoff between taking a distribution with good mean (low bias) 
or one with good variance. Often low bias is of primary importance, and variance 
comes second. This suggests to first fit the mean, then the variance, and possibly 
continue with higher order moments. 

PHI is concerned with predictive performance, not with density estimation, but of 
course they are related. Good density estimation in general and sequential moment 
fitting (SMF) in particular lead to good predictions, but the converse is not necessarily 
true. We will indeed see that PHI for n — oo (under certain conditions) reduces to 
an SMF procedure. 

The SMF algorithm. In our case, the set of available distributions is given by 
{p{9\Q):Qe'H}. For some event A, let 

9"^ := E[9\A] = J 9 p{e\A)d9 and fi^ := E[{9 - 9'^y\A] {k>2) (13) 

be the mean and central moments of ^(6*174). The posterior moments /i^ are known 
and can in principle be computed. SMF sequentially "fits" /if to /i^: Starting with 
T-Co:=H, let Ttk^'Hk-i be the set of 6G7ife-i that minimize — /if |: 

Hfc := {argmin|/if -/if |}, no:=n, 11^:= 9^ 
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Let fc* : = mm{A; : /i® 7^/i^,0 G TYfc} be the smallest k for which there is no perfect fit 
anymore (or oo otherwise). Under some quite general conditions, in a certain sense, 
all and only the Q&Tik* minimize Loss2^(G,D„) for large n. 

Theorem 12 (PHI for large n by SMF) For some k < k* , assume p{x\6) is k 
times continuously dijferentiable w.r.t. 9 at the posterior mean 6^. Let (3>0 and 
assume supgJ\p'^'^\x\9)fd9<oo, /if = 0(ri"^/^), /if = 0(n^^/^), and d{p,q)/\p—qf is 
a bounded function. Then 

Loss]^{e,D) = 0{n-'"^/^) yeeUk {k<k*) 

For the a<l distances we have /? = 1, for the square distance we have (3 = 2 (see 
Section [3]). For i.i.d. distributions with finite moments, the assumption /if = 0{n^^/'^) 
is virtually nil. Normally, no 0g7^ has better loss order than 0{n~''*^^'^), i.e. T-Ck* 
can be regarded as the set of all asymptotically optimal predictors. In many cases, 
Tik* contains only a single element. Note that Tik* does neither depend on m, nor 
on the chosen distance d, i.e. the best predictive hypothesis G = G^ is essentially the 
same for all m and c? if n is large. 

Bernoulli Example. In the Bernoulli Example in Section [2] we considered a binary 
decision between point estimate G/ = {|} and vacuous estimate G„ = [0;1], i.e. 
7io = {G/,Gt,}. For no = ni we have 9^'^'^^ = 9 ^'^ = ^ = 9^, i.e. both fit the first moment 
exactly, hence Hi = Ho- For the second moments we have /if = 4^, but 1^2'^^ = ^2 
nj^^ = 0, hence for large n the point estimate matches the posterior variance better, 
so Q = {|}g7^2 = {G/}, which makes sense. <C> 

For unrestricted (single) point estimation, i.e. TC = {{9},9&]R}, one can typically 
estimate the mean exactly but no higher moments. More generally, finite mixture 
models Q = {9i,...,9i} with / components (degree of freedoms) can fit at most / mo- 
ments. For large /, the number of 9i G O that lie in a small neighborhood of some 9 
(i.e. the "density" of points in G at 9) will be proportional to the likelihood p{D\9). 
Countably infinite and even more so continuous models if otherwise unrestricted are 
sufficient to get all moments right. If the parameter range is restricted, anything can 
happen {k* = oo or k*<oo). For interval estimation T-C = {[a]b]:a,b&lR,a<b} and uni- 
form prior, we have 9^°''''^ = ^{a+b) and = j^ib—a)^ , hence the first two moments 
can be fitted exactly and the SMF algorithm yields the unique asymptotic solution 
Q = [9^ — \/3fi2 i + v^/if] • In higher dimensions, common choices of 7i are convex 
sets, ellipsoids, and hypercubes. For ellipsoids, the mean and covariance matrix can 
be fitted exactly and uniquely similarly to Id interval estimation. While SMF can be 
continued beyond k*, Tik typically does not contain B for k>k* anymore. The correct 
continuation beyond k* is either 7ifc_|_i = {argminee7<j./if } or 7ifc_|_i = {argmaxee?Yj./if } 
(there is some criterion for the choice) , but apart from exotic situations this does not 
improve the order 0{n~^*^^'^) of the loss, and usually | = 1 anyway. 

Exploiting Theorem [6], we see that SMF is also applicable for Loss™ and Loss^. 
Luckily, Offline PHI can also be reduced to 1-Batch PHI: 
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Proposition 13 (Offline = 1-Batch) If Xi^jsj are i.i.d., the Offine Loss is propor- 
tional to the 1-Batch Loss: 

Loss]l^{Q,D) ■= ^ Lossl{Q,Dk)p{xn+i:k\D)dxn+i:k = mLoss\{Q,D) 

k=n 

In particular, Offine PHI equals 1-Batch PHI: B^™ = B^. 

Exploiting Theorem El we see that also Loss2P = mLoss^^+constant. Hence we can 
apply SMF also for Offline PHl2|r and PHl2|r- For square loss, i.i.d. is not essential, 
independence is sufflcient. 



7 Discussion 



Summary. If prediction is the goal, but full Bayes not feasible, one should identify 
(estimate/test/select) the hypothesis (parameter/model/interval) that predicts best. 
What best is can depend on the problem setup: What our benchmark is (Loss, Loss), 
the distance function we use for comparison (ci), how long we use the model (m) 
compared to how much data we have at hand (n), and whether we continue to learn 
or not (Batch, Offline). We have defined some reparametrization and representation 
invariant losses that cover many practical scenarios. Predictive hypothesis identifica- 
tion (PHI) aims at minimizing this loss. For m— >^oo, PHI overcomes some problems 
of and even reconciles (a variation of) MAP and (composite) ML. Asymptotically, 
for n —>■ oo, PHI reduces to a sequential moment fflting (SMF) procedure, which is 
independent of m and d. The primary purpose of the asymptotic approximations was 
to gain understanding (e.g. consistency of PHI follows from it), without supposing 
that they are the most relevant in practice. A case where PHI can be evaluated 
efflciently exactly is when a sufflcient statistic is available. 

Outlook. There are many open ends and possible extensions that deserve fur- 
ther study. Some results have only been proven for specific distance functions. 
For instance, we conjecture that PHI reduces to IMAP for other d (seems true 
for a-distances). Definitely the behavior of PHI should next be studied for semi- 
parametric models and compared to existing model (complexity) selectors like AIC, 
LoRP |Hut07] ■ BIG, and MDL |Grii07j . and cross validation in the supervised case. 
Another important generalization to be done is to supervised learning (classification 
and regression), which (likely) requires a stochastic model of the input variables. PHI 
could also be generalized to predictive density estimation proper by replacing p{x\Q) 
with a (parametric) class of densities q^{x). Finally, we could also go the full way 
to a decision-theoretic setup and loss. Note that Theorems [8] and [12] combined with 
(asymptotic) frequentist properties like consistency of MAP/ML/SMF easily yields 
analogous results for PHI. 
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Conclusion. We have shown that predictive hypothesis identification scores well 
on all desirable properties listed in Section [31 In particular, PHI can properly deal 
with nested hypotheses, and nicely justifies, reconciles, and blends MAP and ML for 
m^n, MDL for m^n, and SMF for n^m. 

Acknowledgements. Many thanks to Jan Poland for his help improving the clarity 
of the presentation. 
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